kv-cache : prepare K/V buffers for separation#14517
Closed
Conversation
2a738fe to
40f8c48
Compare
compilade
reviewed
Jul 3, 2025
src/llama-kv-cache-unified.cpp
Outdated
Comment on lines
72
to
75
Collaborator
There was a problem hiding this comment.
I think OpenELM is a model family which needs this, see #7359
Member
Author
There was a problem hiding this comment.
This is now fixed with the latest commit.
386425f to
886da0a
Compare
Contributor
|
ref, Sounds related to #10860 |
Member
Author
|
#14363 is more relevant. This PR is a standalone preparation step that I extracted to make the final PR easier to review. |
Member
Author
|
Will merge directly #14363 when ready |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
from #14363
Currently, the K and V buffers in the unified KV cache are shared among all the participating sequences (hence the name "unified"). With the upcoming change #14363, the buffers can become separate from each other in order to increase the throughput for parallel decoding use cases. This PR is a preparation step to support that.
There should be no functional changes.
Handling of variable V heads is also done when
ggml_set_rows()is used.LLAMA_SET_ROWS=1 ./bin/llama-cli -hf mradermacher/OpenELM-3B-Instruct-GGUF:Q8_0 \ -p "I believe the meaning of life is" -no-cnv -n 32 -t 1 -s 2 --top-k 1Outdated
The only new restriction is that we require the number of KV heads for all layers to be equal:
https://github.com/ggml-org/llama.cpp/blob/40f8c4830a0a927adf448c3ded96129b9823c90f/src/llama-kv-cache-unified.cpp#L70-L77
Support for varying number of KV heads should be simple - just need to make the correct view of
v_idxswhen FA is disabled. But leaving this for when we actually need it.